Sains Malaysiana 52(8)(2023): 2431-2451

http://doi.org/10.17576/jsm-2023-5208-19

 

A New Single Linkage Robust Clustering Outlier Detection Procedures for Multivariate Data

(Suatu Prosedur Baharu Pengesanan Data Terpencil Berasaskan Pengelompokan Rangkaian Tunggal Teguh bagi Data Multivariat)

 

            SHARIFAH SAKINAH SYED ABD MUTALIB1,2, SITI ZANARIAH SATARI1,* & WAN

NUR SYAHIDAH WAN YUSOFF1

 

1Centre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti

Malaysia Pahang, Lebuhraya Tun Razak, 26300 Gambang, Kuantan, Pahang, Malaysia

2Faculty of Computer, Media and Technology Management, University College TATI, Jalan Panchur, Telok Kalong, 24000 Kemaman, Terengganu, Malaysia

 

Received: 3 January 2023/Accepted: 1 August 2023

 

Abstract

Outliers are abnormal data, and the detection of outliers in multivariate data has always been of interest. Unlike univariate data, outlier detection for multivariate data is insufficient with a visual inspection. In this study, we developed a new single linkage robust clustering outlier detection procedure for multivariate data. A robust estimator, Test on Covariance (TOC) is used to robustified the similarity distance measure, producing robust single linkage clustering. The performance of the new single linkage robust clustering outlier detection procedure is investigated via a simulation study using three outlier scenarios and historical multivariate datasets as illustrative examples. Three performance measures are used, which are pout, pmask, and pswamp. The performance of the new single linkage robust clustering procedure also compared with single linkage clustering using Euclidean and Mahalanobis distances as similarity distance measures as well as TOC. It is found that the new single linkage robust clustering procedure performs well in Outlier Scenario 3 when the mean and covariance matrix are shifted. The new procedure also performs well by successfully detecting all outliers, does not have masking effects in two out of five datasets and does not have swamping effect in all datasets. In conclusion, the new single linkage robust clustering outlier detection procedure is a practical and promising approach and good for simultaneously identifying multiple outliers in multivariate data.

 

Keywords: Multivariate data; outliers; single linkage clustering; Test on Covariance; robust clustering

 

Abstrak

Data terpencil ialah data tidak normal dan pengesanan data terpencil untuk data multivariat sentiasa menjana minat. Tidak seperti data univariat, pengesanan data terpencil untuk data multivariat tidak mencukupi dengan pemeriksaan visual. Dalam kajian ini, kami membangunkan satu prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh bagi data multivariat. Penganggar teguh, Test on Covariance (TOC) digunakan untuk meneguhkan ukuran jarak persamaan, menghasilkan pengelompokan rangkaian tunggal teguh. Prestasi prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh disiasat melalui kajian simulasi menggunakan tiga senario data terpencil dan set data sedia ada multivariat sebagai contoh ilustrasi. Tiga ukuran prestasi digunakan, iaitu pout, pmask dan pswamp. Prestasi prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh juga dibandingkan dengan pengelompokan rangkaian tunggal menggunakan jarak Euclidean dan Mahalanobis sebagai ukuran jarak persamaan beserta TOC. Didapati bahawa prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh berprestasi baik dalam Senario Data Terpencil 3 apabila min dan matriks kovarians dianjakkan. Prosedur baru juga berfungsi dengan baik apabila berjaya mengesan semua data terpencil dan tidak mempunyai kesan masking dalam 2 daripada 5 set data dan tidak mempunyai kesan swamping dalam semua set data. Kesimpulannya, prosedur baru pengesanan data terpencil berasaskan pengelompokan rangkaian tunggal teguh ialah pendekatan yang praktikal dan menjanjikan, serta bagus untuk mengesan data terpencil yang berkelompok secara serentak dalam data multivariat.

 

Kata kunci: Data multivariat; data terpencil; pengelompokan rangkaian tunggal; pengelompokan teguh; Test on Covariance

 

REFERENCES

Abd Mutalib, S.S.S., Satari, S.Z. & Yusoff, W.N.S.W. 2021a. Comparison of robust estimators for detecting outliers in multivariate data. Journal of Statistical Modeling and Analytics 3(2): 36-64.

Abd Mutalib, S.S.S., Satari, S.Z. & Yusoff, W.N.S.W. 2021b. Comparison of robust estimators for detecting outliers in multivariate datasets. Journal of Physics: Conference Series 1988: 1-9.

Abd Mutalib, S.S.S., Satari, S.Z. & Yusoff, W.N.S.W. 2019. A new robust estimator to detect outliers for multivariate data. Journal of Physics: Conference Series 1366(1): 012104. https://doi.org/10.1088/1742-6596/1366/1/012104

Aggarwal, C.C. 2017. Outlier Analysis. 2nd ed. Springer. https://doi.org/10.1016/b978-012724955-1/50180-7

Almeida, J.A.S., Barbosa, L.M.S., Pais, A.A.C.C. & Formosinho, S.J. 2007. Improving hierarchical cluster analysis: A New method with outlier detection and automatic clustering. Chemometrics and Intelligent Laboratory Systems 87(2): 208-217. https://doi.org/10.1016/j.chemolab.2007.01.005

Atkinson, A.C. 1994. Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association 89(428): 1329-1339. https://doi.org/10.1080/01621459.1994.10476872

Atkinson, A.C. & Mulira, H.M. 1993. The stalactite plot for the detection of multivariate outliers. Statistics and Computing 3(1): 27-35. https://doi.org/10.1007/BF00146951

Badaró, J.P.M., Campos, V.P., Oliveira Campos da Rocha, F. & Lima Santos, C. 2021. Multivariate analysis of the distribution and formation of trihalomethanes in treated water for human consumption. Food Chemistry 365: 130469. https://doi.org/10.1016/j.foodchem.2021.130469

Balcan, M-F., Liang, Y. & Gupta, P. 2014. Robust hierarchical clustering. Journal of Machine Learning Research 15: 4011-4051. https://doi.org/10.1109/IMSCCS.2006.167

Becker, C. & Gather, U. 1999. The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association 94(447): 947-955. https://doi.org/10.1080/01621459.1999.10474199

Cabana, E., Lillo, R.E. & Laniado, H. 2021. Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Statistical Papers 62: 1583-1609. https://doi.org/10.1007/s00362-019-01148-1

Cerioli, A., Riani, M. & Torti, F. 2011. Accurate and powerful multivariate outlier detection. Int. Statistical Inst.: Proc. 58th World Statistical Congress. pp. 5608-5613.

Christy, A., Gandhi, M.G. & Vaithyasubramanian, S. 2015. Cluster based outlier detection algorithm for healthcare data. Procedia Computer Science 50: 209-215. https://doi.org/10.1016/j.procs.2015.04.058

Daudin, J.J., Duby, C.D. & Trecourt, P. 1988. Stability of principal component analysis studied by the bootstrap method. Statistics: A Journal of Theoretical Applied Statistics 19(2): 241-258. https://doi.org/10.1080/02331888808802095

De Maesschalck, R., Jouan-Rimbaud, D. & Massart, D. 2000. Tutorial: The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems 50: 1-18. www.elsevier.comrlocaterchemometrics.

Dotto, F., Farcomeni, A., García-Escudero, L.A. & Mayo-Iscar, A. 2018. A reweighting approach to robust clustering. Statistics and Computing 28(2): 477-493. https://doi.org/10.1007/s11222-017-9742-x

Duan, L., Xu, L., Liu, Y. & Lee, J. 2009. Cluster-based outlier detection. Annals of Operations Research 168: 151-168. https://doi.org/10.1007/s10479-008-0371-9

Evans, K., Love, T. & Thurston, S.W. 2015. Outlier identification in model-based cluster analysis. Journal of Classification 32(1): 63-84. https://doi.org/10.1007/s00357-015-9171-5

Fauconnier, C. & Haesbroeck, G. 2009. Outliers detection with the minimum covariance determinant estimator in practice. Statistical Methodology 6(4): 363-379. https://doi.org/10.1016/j.stamet.2008.12.005

Filzmoser, P., Maronna, R. & Werner, M. 2008. Outlier identification in high dimensions. Computational Statistics and Data Analysis 52(3): 1694-1711. https://doi.org/10.1016/j.csda.2007.05.018

Gan, G., Ma, C. & Wu, J. 2007. Data Clustering: Theory, Algorithms, and Applications. Philadelphia: Society for Industrial and Applied Mathematics.

Garcia-Escudero, L.A., Gordaliza, A., Matran, C. & Mayo-Iscar, A. 2010. A review of robust clustering methods. Advances in Data Analysis and Classification 4(2): 89-109. https://doi.org/10.1007/s11634-010-0064-5

García-Escudero, L.A., Gordaliza, A., Matrán, C. & Mayo-Iscar, A. 2008. A general trimming approach to robust cluster analysis. The Annals of Statistics 36(3): 1324-1345. https://doi.org/10.1214/07-AOS515

Hadi, A.S. 1992. Identifying multiple outliers in multivariate data. Journal of the Royal Statistical Society. Series B (Methodological) 54(3): 761-771.

Hadi, A.S., Rahmatullah Imon, A.H.M. & Werner, M. 2009. Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics 1(1): 57-70. https://doi.org/10.1002/wics.6

Hardin, J. & Rocke, D.M. 2004. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis 44(4): 625-638. https://doi.org/10.1016/S0167-9473(02)00280-3

Hawkins, D.M., Bradu, D. & Kass, G.V. 1984. Location of several outliers in multiple-regression data using elemental sets. Technometrics 26(3): 197-208. https://doi.org/10.1080/00401706.1984.10487956

Herwindiati, D.E., Djauhari, M.A. & Mashuri, M. 2007. Robust multivariate outlier labeling. Communications in Statistics-Simulation and Computation 36(6): 1287-1294. https://doi.org/10.1080/03610910701569044

Ijaz, M.F., Attique, M. & Son, Y. 2020. Data-driven cervical cancer prediction model with outlier detection and over-sampling methods. Sensors 20: 1-22.

Jiang, M.F., Tseng, S.S. & Su, C.M. 2001. Two-phase clustering process for outliers detection. Pattern Recognition Letters 22(6-7): 691-700. https://doi.org/10.1016/S0167-8655(00)00131-8

Kalina, J. & Tichavský, J. 2021. The minimum weighted covariance determinant estimator for high-dimensional data. Advances in Data Analysis and Classification. https://doi.org/10.1007/s11634-021-00471-6

Kosinski, A.S. 1999. A procedure for the detection of multivariate outliers. Computational Statistics and Data Analysis 29(2): 145-161. https://doi.org/10.1016/S0167-9473(98)00073-5

Maronna, R.A. & Yohai, V.J. 1995. The behavior of the Stahel-Donoho robust multivariate estimator. Journal of the American Statistical Association 90(429): 330-341. https://doi.org/10.1080/01621459.1995.10476517

Melendez-Melendez, G., Cruz-Paz, D., Carrasco-Ochoa, J.A. & Martínez-Trinidad, J.F. 2019. An improved algorithm for partial clustering. Expert Systems with Applications 121: 282-291. https://doi.org/10.1016/j.eswa.2018.12.027

Milligan, G.W. & Cooper, M.C. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2): 159-179. https://doi.org/10.1007/BF02294245

Mojena, R. 1977. Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal 20(4): 259-363.

Olukanmi, P.O. & Twala, B. 2017. K-means-sharp: Modified centroid update for outlier-robust k-means clustering. 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference, PRASA-RobMech 2017. pp. 14-19. https://doi.org/10.1109/RoboMech.2017.8261116

Pan, J-X., Fung, W-K. & Fang, K-T. 2000. Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and Inference 83(1): 153-167. https://doi.org/10.1016/s0378-3758(99)00091-9

Peña, M. 2018. Robust clustering methodology for multi-frequency acoustic data: A review of standardization, initialization and cluster geometry. Fisheries Research 200: 49-60. https://doi.org/10.1016/j.fishres.2017.12.013

Rencher, A.C. 2002. Methods of Multivariate Analysis. New York: John Wiley & Sons, Inc. https://doi.org/10.2307/2669873

Rocke, D.M. & Woodruff, D.L. 1996. Identification of outliers in multivariate data. Journal of the American Statistical Association 91(435): 1047-1061. https://doi.org/10.1080/01621459.1996.10476975

Rousseeuw, P.J. & Hubert, M. 2011. Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1): 73-79. https://doi.org/10.1002/widm.2

Rousseeuw, P.J. & van Zomeren, B.C. 1990. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association 85(411): 633-639. https://doi.org/10.2307/2289999

Santos-Pereira, C.M. & Pires, A.M. 2002. Detection of outlier in multivariate data: A method based on clustering and robust estimators. In Compstat, edited by Härdle, W. & Rönz, B. Physica, Heidelberg. pp. 291-296. https://doi.org/10.1007/978-3-642-57489-4_41

Salleh, R.M. 2013. A robust estimation method of location and scale with application in monitoring process variability. PhD Thesis. Universiti Teknologi Malaysia (Unpublished).

Satari, S.Z. 2015. Parameter estimation and outlier detection for some types of circular model. PhD Thesis. University of Malaya (Unpublished).

Satari, S.Z., Muhammad Di, N.F. & Zakaria, R. 2019. Single-linkage method to detect multiple outliers with different outlier scenarios in circular regression model. AIP Conference Proceedings 2059: 020003. https://doi.org/10.1063/1.5085946

Satari, S.Z., Muhammad Di, N.F. Zubairi, Y.Z. & Hussin, A.G. 2021. Comparative study of clustering-based outliers detection methods in circular-circular regression model. Sains Malaysiana 50(6): 1787-1798. https://doi.org/10.17576/jsm-2021-5006-24

Saxena, A., Prasad, M., Gupta, A., Bharill, N., Prakash Patel, O.P., Tiwari, A., Er, M.J., Ding, W. & Lin, C-T. 2017. A review of clustering techniques and developments. Neurocomputing 267: 664-681. https://doi.org/10.1016/j.neucom.2017.06.053

Sebert, D.M., Montgomery, D.C. & Rollier, D.A. 1998. A clustering algorithm for identifying multiple outliers in linear regression. Computational Statistics & Data Analysis 27(4): 461-484. https://doi.org/10.1016/S0167-9473(98)00021-8

Sharma, K.K. & Seal, A. 2021. Outlier-robust multi-view clustering for uncertain data. Knowledge-Based Systems 211: 106567. https://doi.org/10.1016/j.knosys.2020.106567

Wada, K., Kawano, M. & Tsubaki, H. 2020. Comparison of multivariate outlier detection methods for nearly elliptical distributions. Austrian Journal of Statistics 49(2): 1-17. https://doi.org/10.17713/ajs.v49i2.872

Wang, H., Bah, M.J. & Hammad, M. 2019. Progress in outlier detection techniques: A survey. IEEE Access 7: 107964-108000. https://doi.org/10.1109/ACCESS.2019.2932769

Werner, M. 2003. Identification of multivariate outliers in large data sets. MSc. University of Colorado (Unpublished).

Xu, D. & Tian, Y. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2(2): 165-193. https://doi.org/10.1007/s40745-015-0040-1

Yesilbudak, M. 2016. Partitional clustering-based outlier detection for power curve optimization of wind turbines. In 5th International Conference on Renewable Energy Research and Applications (ICRERA). pp. 1080-1084.

Yoon, K-A., Kwon, O-S. & Bae, D-H. 2007. An approach to outlier detection of software measurement data using the K-means clustering method. First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007). pp. 443-445. https://doi.org/10.1109/ESEM.2007.49

Zhang, J. 2013. Advancements of outlier detection: A survey. ICST Transactions on Scalable Information Systems 13(1): 1-26. https://doi.org/10.4108/trans.sis.2013.01-03.e2

 

*Corresponding author; email: sharifahsakinah84@gmail.com

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

previous